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Abstract — We consider the problem of sparse sig- 
nal recovery from a small number of random projec- 
tions (measurements). This is a well known NP-hard 
to solve combinatorial optimization problem. A fre- 
quently used approach is based on greedy iterative 
procedures, such as the Matching Pursuit (MP) algo- 
rithm. Here, we discuss a fast GPU implementation 
of the MP algorithm, based on the recently released 
NVIDIA CUDA API and CUBLAS library. The re- 
sults show that the GPU version is substantially faster 
(up to 31 times) than the highly optimized CPU ver- 
sion based on CBLAS (GNU Scientific Library). 

Keywords: GPU programming, Nvidia CUDA, sparse 
signal recovery, random projections, matching pursuit 
algorithm 

1 Introduction 

Recently there has been an increasing interest in recov- 
ering sparse signals from their projection onto a small 
number of random vectors (see [l]-[7] and the references 
within). Sparse signal expansions represent or approxi- 
mate a signal using only a small number of elements from 
a given basis or dictionary. Unfortunately, the sparse re- 
covery problem is known to be a NP-hard combinato- 
rial optimization problem, requiring the enumeration of 
all possible collections of elements in a dictionary and 
searching for the smallest collection which best approxi- 
mates the signal. Several sub-optimal methods have been 
recently developed PQ-[Z], such that a wide range of ap- 
plications have benefited from the progress made in this 
area. These methods show good performance in finding 
the solution of the sparse approximation problem. How- 
ever, their major shortcoming resides in achieving suffi- 
cient computational speed, which limits their application 
to difficult real world applications which require heavy 
computational load. The recent improvements in perfor- 
mance of graphics hardware have made Graphics Pro- 
cessing Units (GPUs) strong candidates for approaching 
this problem. Recently, NVIDIA has released a general- 
purpose oriented API for its graphics hardware, called 
CUDA [gj. In addition, NVIDIA has developed CUBLAS 
which is a GPU optimized version of BLAS library (Ba- 
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sic Linear Algebra Subroutines) built on top of CUDA 
[9j. In this paper we discuss and evaluate the CUBLAS 
implementation of the Matching Pursuit (MP) algorithm 
[TO] . Although MP is not the most accurate algorithm for 
sparse signal recovery, it is still the fastest and most fre- 
quently used in practice, due to its relative simple formu- 
lation. Our numerical results show that the GPU version 
of MP is substantially faster (up to 31 times) than the 
highly optimized CPU version based on CBLAS (GNU 
Scientific Library) [11] . 

2 The Sparse Recovery Problem 

A high dimensional vector (signal) x = [xq, Xm-i] £ 
R M is A"-sparsc in R M if there exists a set of indices 
{mi, mjf} C {0, M — 1}, for small K -C M such 
that: 



7^0 if m6{mi,...,mi{} 
= if m^{mi,...,m]f} 



(1) 



We consider the following encoding/decoding problem: 

• the sparse vector x E R M is encoded in a smaller 
dimensional vector y 6 R N , K < N < M, using a 
randomly generated N x M matrix ^: 



y = *x e N < M, 
which is then submitted to a receiver; 



(2) 



• the receiver's decoding task consists in recovering the 
sparse vector x S R M , given the vector y £ R N and 
the random matrix \&. 

Thus, the coefficients of the vector y S R w are the pro- 
jections of the sparse vector x S R M on the vectors cor- 
responding to the rows of the N x M matrix St. Re- 
ciprocally, we may say that x S R M provides a sparse 
representation of y € in the redundant dictionary 
corresponding to the column vectors tp m of the N x M 
random matrix St: 



* =U>o\-\ip 



M-lJ 



(3) 



For convenience, we assume that the columns of the ran- 
dom matrix \& are normalized: 



\i>n 



1, 



0,...,M- 1. 



(4) 



Also, the analysis of algorithms for sparse signal recov- 
ery shows that the matrix ^ must satisfy the restricted 
isometry condition [1-6]: 

(l-4)||x||^<||*x||^<(l + 4)H2- (5) 

The restricted isometry constant 8 X is defined as the 
smallest constant for which this property holds for all K- 
sparse vectors x G R M . In order to use this condition in 
practice, one would need to be able to design a matrix sat- 
isfying the restricted isometry condition. Recently it has 
been shown that one can generate such a matrix with high 
probability, if the elements of the matrix are drawn in- 
dependently from certain probability distributions, such 
as a Gaussian distribution or a Bernoulli distribution pQ- 
[S] . This is a consequence of the fact that in high dimen- 
sions the probability mass of certain random variables 
concentrates strongly around their expectation. Also, re- 
cent theoretic considerations have shown that in order 
to achieve the restricted isometry condition, any N x M 
matrix \& must have at least N ~ cK \og(M/K) rows for 
some constant c in order for the observation y = \I/x to 
allow an accurate reconstruction of x pQ- [6]. 

Searching for the sparsest x in the dictionary \& that 
matches y leads to the ^ optimization problem: 

x = arg min llxIL s.t. SPx = y. (6) 

Here, ||x|| is the Iq norm, measuring the number of 
nonzero coefficients in the vector x. This combinato- 
rial optimization problem is NP-hard to solve and usu- 
ally the convexification of the objective function is intro- 
duced by replacing the Iq norm with the l\ norm [1-6] 

(Nil = Eli W): 

x = arg min llxIL s.t. \&x = y. (7) 

xER M 

The resulting optimization problem is known as Basis 
Pursuit (BP) and it can be solved using linear program- 
ming techniques whose computational complexities are 
polynomial pQ-[6]. However, the BP approach requires 
the solution of a very large convex, nonquadratic opti- 
mization problem, and therefore still suffers from high 
computational complexity. As an alternative, here we 
consider an iterative greedy approach based on the MP 
algorithm [TO] . MP tackles the problem by operating a lo- 
cal optimization, as opposed to BP's global optimization 
strategy. MP has been proven to achieve an accurate de- 
composition of the signal and it provides a low-complexity 
alternative to BP, but requires an unbounded number of 
iterations for convergence pQ- [7], [T2] , 

3 The Matching Pursuit Algorithm 

Matching Pursuit (MP) is an iterative heuristic algorithm 
which can be used to obtain approximate solutions of the 



sparse recovery problem |10j . Starting from an initial ap- 
proximation x = and residual r = y, the algorithm uses 
an iterative 'greedy' strategy to pick the columns ip m of 
that are the most strongly correlated with the resid- 
ual. Then, successively their contribution is subtracted 
from the residual, which this way can be made arbitrarily 
small. The pseudo-code of the MP algorithm is: 

1. Initialize the variables: 

t ^0,x^0,r^y,T. (8) 

2. While ||r|| 2 > e ||y|| 2 and t < T repeat: 

• Find i such that 

i <— arg max |(r,^j)|. (9) 
. |i w; 

• Update the estimate of the corresponding coef- 
ficient, the residual and the iteration counter: 

Xi <- Xi + (r, tpi) , (10) 

r <- r - {r, Vi) ^i, (11) 
i<-t+l. (12) 

3. Return x. 

The stopping criterion in the step 2 requires the residual 
to be smaller than some fraction e of the 'target' y. Also, 
the computation stops if the number of iterations t ex- 
ceed the maximum number allowed T. A shortcoming of 
the MP algorithm is that although the asymptotic con- 
vergence is guaranteed and it can be easily proven, the 
resulting approximation after any finite number of steps 
will in general be suboptimal. This shortcoming can be 
corrected by using the orthogonal version of MP, at a 
much higher computational cost [5], [7], [12] . 

For random dictionaries with dimensionality N x M, 
each inner product (r, tpi) requires N multiplications and 
N — 1 additions. Each iteration requires M such inner 
product computations. Also, performing enough MP iter- 
ations to get a small reconstruction ||r|| 2 < e||y|| 2 often 
means iterating T ~ M times. Therefore, the cost of 
computing the approximate solution x could be as high 
as 0(N 2 M 2 ) » 0(N 4 ), making MP very slow on high- 
dimensional signals. Also, we should stress once again 
that in the case of sparse signal recovery from random 
projections, the dictionaries are completely random and 
therefore one cannot apply acceleration techniques used 
for structured dictionaries (FFT, DCT, Gabor, Wavelet 
etc.) [13] . Thus, the hardware acceleration provided by 
a GPU implementation is probably the only solution to 
speed up computation in this particular case. 



2 



E <°'P> a 0.5 







P 




Figure 1: Sparse reconstruction error of the MP algo- 
rithm (the error grows from white (0%) to black (100%)). 

In order to determine the effectiveness of the MP algo- 
rithm, we have conducted the following simple experi- 
ment. The sparse signals x G R A/ were generated by 
drawing the K non-zero components from a uniform dis- 
tribution on [— 1, +1], Also, the random N x M matrix 
^ was drawn from a Bernoulli distribution: 

^ f +l/y~N with probability 1/2 

nm \ —1/yN with probability 1/2 

We have considered two parameters a = K/M and p — 
N/M and we have computed the relative reconstruction 
error: 

E(a, p) = ||x - xf 2 ||x||^ 2 x 100%, (14) 
as a function of a and p. 

The results for M = 1000, averaged over 100 trials, are 
shown in Figure 1. The stopping parameters were set to 
e = 10~ 7 and respectively T = N. The numerical results 
are in very good agreement with the previous theoreti- 
cal considerations, showing a logarithmic dependence be- 
tween p and tr. For a small number of projections (small 
p) the algorithm is unstable and the reconstruction error 
grows. By increasing the number of projections (large p) 
the MP algorithm is able to recover exactly the if-sparse 
signal x e R M . 

4 CPU vs GPU (BLAS vs CUBLAS) 

Due to their tremendous memory bandwidth and com- 
putational horsepower, GPUs are becoming an effi- 
cient alternative to solve computer-intensive applica- 
tions. The newly developed GPUs now include fully 
programmable processing units that follow a stream pro- 
gramming model and support vectorized single and dou- 
ble precision floating-point operations. For example, the 
CUDA computing environment provides a standard C 
like language interface to the NVIDIA GPUs [5]. The 



computation is distributed into sequential grids, which 
are organized as a set of thread blocks. The thread blocks 
are batches of threads that execute together, sharing lo- 
cal memories and synchronizing at specified barriers. An 
enormous number of blocks, each containing maximum 
512 threads, can be launched in parallel in the grid. 

In this paper we make use of CUBLAS, a recent imple- 
mentation of BLAS, developed by NVIDIA on top of the 
CUDA programming environment [9]. CUBLAS library 
provides functions for: 

• creating and destroying matrix and vector objects in 
GPU memory; 

• transferring data from CPU mainmemory to GPU 
memory; 

• executing BLAS on the GPU; 

• transferring data from GPU memory back to the 
CPU mainmemory. 

BLAS defines a set of fundamental operations on vec- 
tors and matrices which can be used to create optimized 
higher-level linear algebra functionality: 

• Level 1 BLAS perform scalar, vector and vector- 
vector operations; 

• Level 2 BLAS perform matrix-vector operations; 

• Level 3 BLAS perform matrix-matrix operations. 

Highly efficient implementations of BLAS exist for most 
current computer architectures and the specification of 
BLAS is widely adopted in the development of high qual- 
ity linear algebra software, such as LAPACK, IMKL and 
GNU Scientific Library (GSL) llj. We selected GSL 
CBLAS, for our host (CPU) implementation, due to its 
portability on various platforms (Windows/Linux/OSX, 
Intel/AMD) and because it is free and easy to use in com- 
bination with GCC (GNU Compiler). The GSL library 
provides a low-level layer which corresponds directly to 
the C-language BLAS standard, referred here as CBLAS, 
and a higher-level interface for operations on GSL vectors 
and matrices [llj . 

The CBLAS and CUBLAS implementations of the MP 
algorithm require the following functions/kernels: 

CBLAS: gsl_blas_sgemv, gsl_blas_dgemv 
CUBLAS: cublasSgemv, cublasDgemv 

• compute in single/double precision the matrix- vector 
product and sum: 

y =aAx + ,3 or y =aA T x + /3y. (15) 



3 



CBLAS: gsLblas_isamax, gsl_blas_idamax 
CUBLAS: cublaslsamax, cublasldamax 

• return the smallest index of the maximum magnitude 
element of the single/double precision vector x: 

to = argmax \x m \ . (16) 

CBLAS: int gsl_blas_saxpy, gsl_blas_daxpy 
CUBLAS: cublasSaxpy, cublasDaxpy 

• compute the single/double precision sum 

y = ax + y. (17) 

CBLAS: gsl_blas_snrm2, gsl_blas_dnrm2 
CUBLAS: cublasSnrm2, cublasDnrm2 

• compute the Euclidean norm of a single/double pre- 
cision vector: 



M 



\ m=0 



(18) 



These are the critical functions/kernels, which are effi- 
ciently exploited in the parallel CUBLAS implementa- 
tion. The other involved functions are for vector/matrix 
memory allocation and vector/matrix accessing, device 
(GPU) initialization, host-device data transfer and error 
handling. We stress once again that in the CUBLAS im- 
plementation the data space is allocated both on host 
(CPU) mainmemory and on device (GPU) memory. Af- 
ter the data is initialized on host it is transferred on de- 
vice, where the main parallel computation occurs. The 
results are then transferred back on host memory where 
the solution can be checked against the reference. The 
data transfer from host to device and back is not crit- 
ical for MP. For example the host to device dictionary 
transfer only takes place once, since all signals are en- 
coded/decoded with the same dictionary. The time for 
signal transfer from host to device and back is insignif- 
icant comparing to the device computation time. For 
convenience, we have included also a timer which mea- 
sures the time of all main operations involved by the 
MP algorithm. The double precision code for CBLAS 
and CUBLAS implementations is given in Appendix 1 
and respectively Appendix 2. These implementations 
can be easily modified in order to meet end user's needs. 
For example the single precision BLAS data allocation 
and vector/matrix accessing functions have the prefix 
gsLvectorJloat, gsLmatrixJloat etc. (see [9], pj] for de- 
tails). 



The numerical experiments have been carried out on the 
following system custom build by the author: CPU: AMD 
Phenom 9950 (2.6GHz); Motherboard: ASUS M3N78 
Pro (chipset GeForce m8300); RAM 8Gb DDR2 800MHz; 
On-board GPU: 8400GS, 512 Mb DDR2 (shared); PCI-E 
GPU: XFX GTX280, 1024 Mb DDR3; NVIDIA Linux 64- 
bit driver (177.67); CUDA Toolkit and SDK 2.0; Ubuntu 
Linux 64-bit 8.04.1, GNU Scientific Library v.1.11; Com- 
pilers: GCC (GNU), NVCC (NVIDIA). This system con- 
figuration gives the opportunity to evaluate the perfor- 
mance of the GPU code on two different GPUs. The first 
GPU (GPUO) is the core (G86) of the GeForce 8400GS, 
which is used as an entry level, on-board solution for the 
GeForce m8300 motherboard, it has 16 stream processors 
and 512 Mb DDR2 RAM (shared), and supports only sin- 
gle precision floating-point operations. The second GPU 
(GPU1) is the core (GT200) of GeForce GTX280, a high 
end solution with 240 stream processors and 1Gb DDR3 
RAM, which supports both single and double precision 
and it is theoretically capable of 1 Tflop computational 
power. 

In order to generate the same random dictionaries 
and signals, the seed of the random number generator 
(srandQ) was set to the same value in both CBLAS and 
CUBLAS implementations. We varied the length M of 
the signal and the size of the dictionary from M = 1000 
to M = 15000. The number of non-zero elements in 
the sparse signal was fixed to K — [0.07A/J (a = 0.07) 
and the number of projections was set to N — \M/2\ 
(p = 0.5). Also, the maximum number of iterations and 
the reconstruction error were set to T = N and respec- 
tively e = 10~ 7 . With these parameters, the MP algo- 
rithm is able to find the solution with an error e = 10~ 7 
in maximum TV iteration steps. 

In Figure 2 we give the results obtained for single and 
respectively double precision. It is interesting to note 
that for small dictionaries M < 2000 the CPU is actually 
faster than the GPUO. The gap between CPU and GPU1 
increases very fast by increasing the size of the dictionary. 
The GPU1 versus CPU speed reaches a maximum of 31 
times in single precision and respectively 21 times in dou- 
ble precision, for M = 15, 000. These results shows once 
again that GPU performance is dependent on the scale 
of the problem. Thus, in order to exploit efficiently the 
massive parallelism of GPUs and to effectively use the 
hardware capabilities, the problem itself needs to scale 
accordingly, such that thousands of threads are defined 
and used in computation. 

The current version of CUBLAS is not yet perfectly tuned 
comparing to the highly optimized CBLAS. This is re- 
vealed by using a logarithmic time scale representation 
of the previous results. In Figure 3 one can see that the 
CPU computation time is monotonously and smoothly 
increasing with the size of the problem (as expected) for 
the CBLAS code, while the GPU running time exhibits a 
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Figure 2: Numerical results for CPU and GPU computa- 
tion: a - CPU double precision; b - CPU single precision; 
c - GPU0 single precision; d - GPU1 double precision; e 
- GPU1 single precision. 

discontinuity around M — 8192, where the performance 
drops quite sharply. This discontinuity is present for both 
GPUs and also for both single and double precision imple- 
mentations. Thus, we conclude that it is not a hardware 
artifact but rather a CUBLAS implementation artifact. 

5 Conclusion 

In this paper we have presented a comparison between 
the CBLAS (CPU) and the CUBLAS (GPU) implemen- 
tations of the MP algorithm, a frequently used method to 
recover sparse signals from random projections. The MP 
algorithm efficiently exploits the high computing power 
of the GPU. For large problems, the GPU code is up to 
31 (21) times faster, in single (double) precision, than 
the CPU code. For example, the sparse reconstruction 
time for a signal of a length M = 16, 384 encoded with 
N = 8, 192 random vectors takes in average 1,900 s on 
a CPU, while the same decoding problem takes less than 
58 s on the GPU. This kind of performance improve- 
ment makes it possible to approach even more compu- 
tational hungry algorithms like the orthogonal version of 
MP, which is our next target for GPU implementation. 
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Appendix 1 

/* 

* Double precision CBLAS (GSL) implementation 

* of Matching Pursuit algorithm for 

* sparse signal recovery from random projections 
*/ 

/* Includes, system */ 
#include <stdio.h> 
#include <math.h> 
#include <time.h> 

/* Includes, GSL & CBLAS */ 
#include <gsl/gsl_vector .h> 
#include <gsl/gsl_matrix.h> 
#include <gsl/gsl_blas .h> 

/* Number of columns and rows in dictionary */ 
#define M (4000) 
#define N ( (int) (M/2) ) 

/* Number of non-zero elements in sparse signal */ 
int K = 0.07*M; 

/* Residual error */ 
double epsilon = 1.0e-7; 

/* Maximum number of iterations */ 
int T = N; 

/* Sign function */ 

double sign(double x){return (x>=0) - (x<0);} 

int main(int argc, char** argv) 
{ 

int n, m, k, t, q; 

double normi, normf , a , norm = sqrt(N), htime; 

printf ( ' ' \nProblem dimensions : 

NxM^/.dx'/.d, K=7.d", N, M, K) ; 

/* Initialize srand and clock */ 
srand (time (NULL)) ; 
clock_t start = clockO; 

/* Initiallize Bernoulli random dictionary */ 
gsl_matrix *D = gsl_matrix_alloc (N, M) ; 
for(m=0; m<M; m++) 
{ 

for(n=0; n<N; n++) 
{ 

a = sign(2.0*rand()/(double)RAND_MAX-l .0)/norm; 

gsl_matrix_set (D, n, m, a); 

} 

} 

/* Create a random K-sparse signal */ 
gsl_vector *x = gsl_vector_alloc(M) ; 
for(k=0; k<K; k++) 
{ 

gsl_vector_set(x, rand()7,M, 2.0*rand() 
/ (f loat)RAND_MAX - 1.0); 

} 



/* Allocate memory for solution */ 
gsl_vector *z = gsl_vector_calloc (M) ; 

/* Allocate memory for residual */ 
gsl_vector *r = gsl_vector_calloc (M) ; 

/* Allocate memory for the encoding vector */ 
gsl_vector *y = gsl_vector_alloc(N) ; 
htime = ((double)clock()-start)/CLOCKS_PER_SEC; 
printf (" \nTime for data allocation: %f'', htime); 

/* Encoding the signal */ 
start = clockO ; 

gsl_blas_dgemv(CblasNoTrans, 1.0, D, x, 0.0, y); 
htime = ((double)clock()-start)/CLOCKS_PER_SEC; 
printf (" \nTime for encoding: %f", htime); 

/* Decoding the signal */ 

start = clockO ; 

normi = gsl_blas_dnrm2(y) ; 

epsilon = sqrt(epsilon * normi); 

normf = normi ; 

t = 0; 

while (normf > epsilon && t < T) 
{ 

gsl_blas_dgemv(CblasTrans , 1.0, D, y, 0.0, r) ; 
q = gsl_blas_idamax(r) ; 
a = gsl_vector_get (r , q) ; 

gsl_vector_set(z, q, gsl_vector_get (z , q) + a) ; 
gsl_blas_daxpy(-a, 

&gsl_matrix_column(D, q) .vector, 

y); 

normf = gsl_blas_dnrm2(y) ; 

t++; 

} 

htime = ((double)clock()-start)/CLOCKS_PER_SEC; 

printf (" \nTime for decoding: 7,f ' ' , htime); 

a = 100.0*(normf *normf )/(normi*normi) ; 

printf (' '\nComputation residual error: °/,f ", a); 

/* Check the solution against the reference*/ 
printf (" \nSolution (first column), 

Reference (second column):''); 
getcharO ; 
for(m=0; m<M; m++) 
{ 

printf (' '\n'/,f \t'/,f '' , gsl_vector_get (x, m) , 
gsl_vector_get(z, m)); 

} 

normi = gsl_blas_dnrm2(x) ; 

gsl_blas_daxpy(-1.0, x, z) ; 

normf = gsl_blas_dnrm2(z) ; 

a = 100.0*(normf *normf )/(normi*normi) ; 

printf (' '\nSolution residual error: 7.f \n' ' , a); 

/* Memory clean up and shutdown*/ 
gsl_vector_f ree (y) ; 
gsl_vector_f ree(r) ; 
gsl_vector_f ree (z) ; 
gsl_vector_f ree(x) ; 
gsl_matrix_f ree(D) ; 
return EXIT_SUCCESS ; 
} 



Appendix 2 

/* 

* Double precision CUBLAS (NVIDIA) implementation 

* of Matching Pursuit algorithm for 

* sparse signal recovery from random projections. 
*/ 
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/* Includes, system */ 
#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
#include <time.h> 

/* Includes, cuda */ 
#include <cublas.h> 

/* Number of columns and rows in dictionary */ 
#define M (2000) 
#define N ((int) (M/2) ) 

/* Number of non-zero elements in sparse signal */ 
int K = 0.07*M; 

/* Residual error */ 
double epsilon = 1.0e-7; 

/* Maximum number of iterations */ 
int T = N; 

/* Sign function */ 

double sign(double xMreturn (x>=0) - (x<0);} 

/* Matrix indexing convention */ 

#define id(m, n, Id) (((n) * (Id) + (m))) 

int main(int argc, char** argv) 
{ 

cublasStatus status; 

double *h_D, *h_X, *h_C, *c; 

double *d_D = 0, *d_S = 0, *d_R = 0; 

int MN = M*N, m, n, k, q, t; 

double norm=sqrt (N) , normi, normf , a, dtime; 

printf ( ' ' \nProblem dimensions : 

NxM='/,dx'/.d, K='/.d", N, M, K) ; 

/* Initialize srand and clock */ 
srand (time (NULL)) ; 
clock_t start = clockO; 

/* Initialize cublas */ 
status = cublaslnit () ; 
if (status != CUBLAS_STATUS_SUCCESS) 
{ 

f printf (stderr , 

''! CUBLAS initialization error\n''); 
return EXIT_FAILURE; 
} 

/* Initialize dictionary on host */ 
h_D = (double*)malloc(MN * sizeof (h_D [0] ) ) ; 
if(h_D == 0) 
{ 

f printf (stderr , "! host memory allocation error 

(dictionary) \n' ' ) ; 
return EXIT_FAILURE; 
} 

for(n = 0; n < N; n++) 
{ 

for(m =0; m < M; m++) 
{ 

a = sign(2.0*rand()/(double)RAND_MAX-l .0)/norm; 

h_D[id(m, n, M)] = a; 

} 

} 

/* Create a random K-sparse signal */ 
h_X = (double*) calloc(M, sizeof (h_X [0] )) ; 



if(h_X == 0) 
{ 

f printf (stderr , "! host memory allocation 

error (signal) \n' ') ; 
return EXIT.FAILURE; 
} 

for(k=0; k<K; k++) 
{ 

a = 2.0*rand()/(double)RAND_MAX - 1.0; 

h_X[rand()7.M] = a; 

} 

/* Allocate solution memory on host */ 
h_C = (double*) calloc(M, sizeof (h_C [0] )) ; 
if(h_C == 0) 
{ 

f printf (stderr , "! host memory allocation 

error (solution) \n' ') ; 
return EXIT_FAILURE; 
} 

c = (double*) calloc (1 , sizeof (c)); 
if(c == 0) 
{ 

f printf (stderr , "! host memory allocation 

error (c)\n' ' ) ; 
return EXIT.FAILURE; 
} 

dtime = ((double) clock () -start )/CL0CKS_PER_SEC; 
printf (" \nTime for host data allocation: 

%f", dtime); 
start=clock() ; 

/* Host to device data transfer: dictionary */ 
status = cublasAlloc(MN, sizeof (d_D [0] ) , 

(void**)&d_D) ; 
if (status != CUBLAS_STATUS_SUCCESS) 
{ 

f printf (stderr , "! device memory allocation 

error (dictionary) \n' ') ; 
return EXIT.FAILURE; 
} 

status = cublasSetVector(MN, sizeof (h_D [0] ) , 

h_D, 1, d_D, 1); 
if (status != CUBLAS_STATUS_SUCCESS) 
{ 

f printf (stderr , "! device access error 

(write dictionary)\n' ' ) ; 
return EXIT.FAILURE; 
} 

/* Host to device data transfer: signal */ 
status = cublasAlloc(M, sizeof (d_R [0] ) , 

(void**)&d_R) ; 
if (status != CUBLAS_STATUS_SUCCESS) 
{ 

f printf (stderr , "! device memory allocation 

error (signal) \n' ') ; 
return EXIT.FAILURE; 
} 

status = cublasSetVector(M, sizeof (h_X [0] ) , 

h_X, 1, d_R, 1); 
if (status != CUBLAS_STATUS_SUCCESS) 
{ 

f printf (stderr , "! device access error 

(write signal)\n' ' ) ; 
return EXIT.FAILURE; 
} 

status = cublasAlloc(N, sizeof (d_S [0] ) , 

(void**)&d_S) ; 
if (status != CUBLAS_STATUS_SUCCESS) 
{ 

f printf (stderr , "I device memory allocation 
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error (projected vector) \n' ') ; 
return EXIT_FAILURE; 
} 

dtime = ( (double) clockO -start)/CLOCKS_PER_SEC ; 
printf ( ' ' \nTime for Host to Device data transfer: 
7,f (s)", dtime); 

/* Encoding the signal on device*/ 
start = clock () ; 

cublasDgemvCt' , M, N, 1.0, d_D, M, 

d_R, 1, 0.0, d_S, 1); 
status = cublasGetError () ; 
if (status != CUBLAS_STATUS_SUCCESS) 
{ 

fprintf (stderr , "! kernel execution 

error (encoding) \n' ' ) ; 
return EXIT_FAILURE; 
} 

dtime = ( (double) clockO -start)/CLOCKS_PER_SEC ; 
printf (' '\nTime for encoding: 7,f (s) ' ' , dtime); 

/* Decoding the signal on device*/ 
start = clockO ; 

normi = cublasDnrm2 (N, d_S, 1); 
epsilon = sqrt (epsilon*normi) ; 
normf = normi ; 
t = 0; 

while (normf > epsilon && t < T) 
{ 

cublasDgemvCn' , M, N, 1.0, d_D, M, 

d_S, 1, 0.0, d_R, 1); 
q = cublasldamax (M, d_R, 1) - 1; 
cublasGetVector(l , sizeof(c), 

&d_R[q], 1, c, 1); 
h_C[q] = *c + h_C[q] ; 

cublasDaxpy (N, -(*c) , &d_D[q], M, d_S, 1); 

normf = cublasDnrm2 (N, d_S, 1); 

t++; 

} 

status = cublasGetError () ; 
if (status != CUBLAS_STATUS_SUCCESS) 
{ 

fprintf (stderr , "! kernel execution 

error (decoding) \n' ') ; 
return EXIT_FAILURE; 
} 

dtime = ( (double) clockO -start)/CL0CKS_PER_SEC; 
printf (' '\nTime for decoding: %f (s) ' ' , dtime); 
a = 100 . 0* (normf *normf )/(normi*normi) ; 
printf (' '\nComputation residual error: 7,f ' ' , a); 

/* Check the solution */ 

printf (' '\nSolution (first column), 

Reference (second column) : ' ' ) ; 
get char () ; 
for(m=0; m<M; m++) 
{ 

printf ("\n"/.f\t'/.f h_C[m], h_X[m]); 
} 

normi = 0; normf = 0; 
for(m=0; m<M; m++) 
{ 

normi = normi + h_X [m] *h_X [m] ; 
normf = normf + 

(h_C[m] - h_X [m] ) * (h_C [m] - h_X[m]); 

} 

printf (" \nSolution residual error: 
7,f ' ' , 100 . 0*normf /normi) ; 

/* Memory clean up */ 
free(h_D) ; 
free(h_X) ; 



free(h_C) ; 

status = cublasFree (d_D) ; 
status = cublasFree (d_S) ; 
status = cublasFree (d_R) ; 
if (status != CUBLAS_STATUS_SUCCESS) 
{ 

fprintf (stderr, 

"! device memory free error\n''); 
return EXIT.FAILURE; 
} 

/* Shutdown */ 
status = cublasShutdownO ; 
if (status != CUBLAS_STATUS_SUCCESS) 
{ 

fprintf (stderr, 

"! cublas shutdown error\n''); 
return EXIT_FAILURE; 
} 

if(argc<=l || strcmp(argv[l] , ' '-noprompt' ')) 
{ 

printf ("\nPress ENTER to exit...\n"); 

get char () ; 

} 

return EXIT_SUCCESS ; 
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